Genome-wide association studies and genomic selection assays made in a large sample of cacao (Theobroma cacao L.) germplasm reveal significant marker-trait associations and good predictive value for improving yield potential

A genome-wide association study (GWAS) was undertaken to unravel marker-trait associations (MTAs) between SNP markers and phenotypic traits. It involved a subset of 421 cacao accessions from the large and diverse collection conserved ex situ at the International Cocoa Genebank Trinidad. A Mixed Linear Model (MLM) in TASSEL was used for the GWAS and followed by confirmatory analyses using GAPIT FarmCPU. An average linkage disequilibrium (r2) of 0.10 at 5.2 Mb was found across several chromosomes. Seventeen significant (P ≤ 8.17 × 10−5 (–log10 (p) = 4.088)) MTAs of interest, including six that pertained to yield-related traits, were identified using TASSEL MLM. The latter accounted for 5 to 17% of the phenotypic variation expressed. The highly significant association (P ≤ 8.17 × 10−5) between seed length to width ratio and TcSNP 733 on chromosome 5 was verified with FarmCPU (P ≤ 1.12 × 10−8). Fourteen MTAs were common to both the TASSEL and FarmCPU models at P ≤ 0.003. The most significant yield-related MTAs involved seed number and seed length on chromosome 7 (P ≤ 1.15 × 10−14 and P ≤ 6.75 × 10−05, respectively) and seed number on chromosome 1 (P ≤ 2.38 × 10−05), based on the TASSEL MLM. It was noteworthy that seed length, seed length to width ratio and seed number were associated with markers at different loci, indicating their polygenic nature. Approximately 40 candidate genes that encode embryo and seed development, protein synthesis, carbohydrate transport and lipid biosynthesis and transport were identified in the flanking regions of the significantly associated SNPs and in linkage disequilibrium with them. A significant association of fruit surface anthocyanin intensity co-localised with MYB-related protein 308 on chromosome 4. Testing of a genomic selection approach revealed good predictive value (genomic estimated breeding values (GEBV)) for economic traits such as seed number (GEBV = 0.611), seed length (0.6199), seed width (0.5435), seed length to width ratio (0.5503), seed/cotyledon mass (0.6014) and ovule number (0.6325). The findings of this study could facilitate genomic selection and marker-assisted breeding of cacao thereby expediting improvement in the yield potential of cacao planting material.

Introduction Cacao, Theobroma cacao L., Malvaceae sensu lato [1], is an important Neotropical, perennial crop, on which the thriving global cocoa and chocolate industry is based. The World Cocoa Foundation, in 2012, reported that 40-50 million people worldwide depend on cocoa for their livelihood. The global export value of cocoa beans has been fluctuating between 8 and 10.5 billion USD over the past decade and the global chocolate market was valued at USD 106.6 billion in 2019 [2]. T. cacao is a diploid (2n = 20), allogamous species. Its genome is small, reported by Argout et al. [3] to span 411-494 Mb. Its putative centre of genetic diversity is at the headwaters of the Amazon River, South America [4], and it is indigenous to the Amazon and Orinoco rainforests.
In the current context of climate change and disease evolution, it is important to acquire a comprehensive understanding of the genetic architecture of traits of interest such as disease and pest resistance, high yield, desirable flavour, favourable flavanol (antioxidant) content, self-compatibility, and other traits of economic relevance or with potential health benefits [5][6][7][8]. Of these, yield and disease resistance traits are mainly polygenic [9,10] and tagging such traits with molecular markers will facilitate early selection for the desired genotypes and formulation of breeding strategies to create new, adapted varieties. The considerable progress made in cacao genomics in the last decade [7,, along with the increase in the number of available markers, will help to accelerate progress in improvement of cacao planting material.
Genome-wide association studies (GWAS) have been used widely in breeding to detect significant population-wide associations between genetic markers, such as Single Nucleotide Polymorphisms (SNPs), and causal polymorphisms, with functional roles in controlling traits of interest like yield potential [35,36]. The search for genotype-phenotype correlations is conducted in unrelated individuals [37,38], and is based on non-random association of alleles in the population and linkage disequilibrium (LD) [39,40]. Robust GWAS models, using population structure and kinship matrices as co-variates, facilitate the identification of mainly nonspurious associations [36].
This application of GWAS in T. cacao L. is facilitated by the mapping of the cacao genome and several previous studies. Allegre et al. [33] and Fouet et al. [32] identified and mapped SNPs and microsatellite (SSR) markers useful as expressed sequence tags (ESTs) and constructed a high-density genetic map for T. cacao. Both kinds of markers are co-dominant and thus powerful for genetic analysis. Allegre et al. [33], mapped 851 SNP markers corresponding to genes with putative functions and displaying a distinct polymorphic pattern across a selection of cacao germplasm. These SNPs are located within a gene expressed sequence and thus valuable for identifying candidate genes with functional roles in cacao. The data are available at http://tropgenedb.cirad.fr.
The sequencing of the cacao genome [3,29] should facilitate the identification of candidate genes for traits of interest, which are co-localized with marker-trait associations (MTAs).
Version 2.0 of the Criollo genome of Argout et al. [29] has 99% of the assembly anchored to the 10 chromosomes of the T. cacao L. genome. This resource was utilized in this study.
The objectives of this research were to exploit the naturally occurring genetic variation in a large collection of cacao trees of diverse origin, conserved ex situ, to perform GWAS to identify SNP markers significantly associated with phenotypic traits, examine linkage disequilibrium patterns on the chromosomes, and locate putative annotated candidate genes that encode yield-related and other traits. A genomic selection approach was then used to establish predictive values for phenotypic traits of interest in order to examine the efficiency of this breeding strategy to improve cacao yield.

Plant material
The germplasm studied consisted of four hundred and twenty-one (421) cacao accessions, including 263 wild genotypes (collected in the Amazon Basin [41]), which are conserved at the International Cocoa Genebank Trinidad (ICGT). Complete phenotypic data were collated for 346 of these accessions and incomplete data were accrued for 75 accessions (S1 29601.48484). The accessions represent 23 "accession groups", as described by Bekele et al. [41], as well as most of the genetic groups defined by Motamayor et al. [42], Wild cacao types such as those of the AMAZ, GU, IMC, MO, NA, PA, POUND, RB, SCA and SPEC  accession groups [41], which have evolved over a long period of time, were included to improve the power of detection of associations between SNP markers and phenotypic traits of interest as recommended by Stack et al. [40].
Plant material was collected from the 421 accessions at the ICGT. Sample collection spanned the period 1992-2012. The ICGT is situated at the University Cocoa Research Station, Centeno, North-East Trinidad at an altitude of 15 m above sea level. Shade is provided by trees of Erythrina sp. and bananas (Musa sp.) planted 6 m and 4 m apart, respectively. Each plot typically had 16 trees of a specific accession, which are planted 1.8 m apart. The soil type is Cunupia fine sandy clay with restricted internal drainage.
Over a 30-year period from 1981, the average temperature in this region was 26.3ºC. It was 26ºC for the period 1961 to 1991. This satisfied the optimal temperature requirements for growing cacao. The mean annual rainfall for 1981 to 2011 was 1945.2 mm, lower than the 2,392 mm recorded for 1961 to 1991 (Trinidad and Tobago Meteorological Office https:// www.metoffice.gov.tt). The trees received irrigation as necessary during the dry season (January-June) each year. The plots were maintained with standard production practices regarding weeding and pruning. However, no pesticides or fungicides were applied to allow monitoring of the natural reactions of the trees to pest and diseases and yield parameters. Fertilizers were applied at planting and regularly for only young trees and the trees were maintained within a low input system. during the same season to preclude the effect of seasonality on phenotypic trait expression. The fruits characterized were the products of open pollination. These data are available online in the International Cocoa Germplasm Database (ICGD) (http://www.icgd.rdg.ac.uk/).

Genotyping by sequencing
SNP markers that provide good coverage of the cacao genome [33] were employed in this study. The selection of 836 SNPs in coding sequences, which displayed significant similarity with known protein sequences, as described by Argout et al. [34], was carried out by Allegre et al. [33]. Illumina SNP genotyping was performed with the Illumina BeadArray platform at the French National Genotyping Centre (CNG, CEA-IG, Evry, France), according to the Gold-enGate Assay manufacturer's protocol. The genotype calling of each marker was verified using reference genotypes and filtered, as described by Argout et al. [34] and Allegre et al. [33]. The QualitySNP pipeline was used for detection of SNPs in the unigenes. All of the SNPs employed  [56]. In TASSEL, the Mixed Linear model (MLM) tested fixed effects to determine associations between genetic sites and phenotypes. FarmCPU was employed to verify the results generated with TASSEL MLM since it caters for random as well as fixed effects in the MLM, and removes false positive and negative associations [56]. MLM in TASSEL was used to correct covariances due to relatedness at the population level between genotypes (due to population structure) as well as co-ancestry (kinship) or identity by descent. The inclusion of the K (kinship) matrix allowed the inclusion of multiple backgrounds QTL as a random factor in the mixed linear model, as explained by Henderson [57]. The scaled, centred 'IBS' method, described by Endelman and Jannink [58], was used to generate the Kinship matrix of relationships among genotypes using TASSEL.
The statistical model used for the TASSEL MLM is as follows: Where Y is the vector of observations, β an unknown vector of fixed effects including genetic marker and population structure (Q); u is an unknown vector of random additive genetic effects from multiple background QTL for the individuals; X and Z are the known design matrices; and e is the unobserved vector of random residuals.
Each marker allele was fit as a separate class with heterozygotes fit as an additional class. The minor allele frequency (MAF) was set to > 0.05. SNPs with more than 14% missing values were removed and imputation was performed to replace any other missing values using the Euclidean distance measure in the k-nearest-neighbour algorithm [59] generated for the 10 nearest neighbours [55]. Phenotypic data for 75 accessions were imputed based on their genetic profiles.
The stringent Bonferroni correction test to cater for multiple testing [55] was applied to the derived P values to test the significance of the associations between traits and markers. Manhattan plots, generated in R [47], were also used to check for evidence of P value inflation and to identify significant MTAs based on TASSEL MLM.
Simple M in R [47] was used to determine the number of independent markers that should be used in testing to avoid the penalty of the stringent Bonferroni correction test, as prescribed by Gao et al. [60]. There were 664 SNP markers that were independent.
The LD pattern was tested with 100,000 permutations in TASSEL to obtain P values for the tests. The correlation between alleles at two loci, r 2 , and the standardized disequilibrium coefficient for determining whether recombination or homoplasy had occurred between a pair of alleles, D 2 , were derived. Fisher's exact test was calculated to compare alleles at any two loci. LD heatmap was used to scan for high linkage disequilibrium within chromosomes based on r 2 values. High LD was characterized by many red squares in the heatmap generated. Many red blocks together would be inferred as corresponding to haplotype blocks [61,62].
Loess regression was performed in R [47] to investigate LD decay by plotting the r 2 values as a function of genetic distance in megabases (Mb). The decay was plotted to show the points representing the distance on each chromosome, at which the mean value of r 2 decreased to half of the maximum value.
Quantile-quantile plots were also generated in R [47] to search for evidence of bias in the GWAS, such as due to genotyping artifacts, and to discern the extent to which the observed distribution of the test statistic followed the expected (null) distribution. The proportion of phenotypic variance explained by a marker was determined by the square of the partial correlation coefficient (R 2 ).
Genomic prediction. The predictive breeding value (GEBV) (along with the predictive error variance (PEV)) for each phenotypic trait was derived using ridge regression in TASSEL 5.2.50 software, which performed multiple correlation tests to compute correspondence between genotype and phenotype and accounted for collinearity to avoid bias. The genotypic data (737 SNPs in alpha format) were loaded, and kinship analysis was performed to generate an Identity by Descent (IBD) file. The phenotypes and K matrix were selected, and the genomic selection routine was run with 5-fold cross validation with 100 iterations.
Detecting candidate genes located within marker-trait association zones. The identification of putative candidate genes was done using the latest cocoa genome sequence (T. cacao Criollo genome version 2) [29] available on the Cocoa Genome hub (https://cocoa-genomehub.southgreen.fr/jbrowse). The biological functions of genes/transcripts close to the significant SNPs were determined by searching the SNP sequences with proteins related to seed development on the cocoa genome after referring to UniProtKB 2021 to identify such proteins. Flanking regions upstream and downstream of stable MTAs were selected to locate putative candidate genes. The lengths of the flanking regions were determined based on the linkage disequilibrium decay intervals on the respective chromosomes. For chromosomes 3, 5, 6 and 8, this was over a distance of 2.5 Mb upstream or downstream of the significant MTAs. It was over a distance of 5 Mb for chromosome 1, 1.6 Mb for chromosome 4, 0.87 Mb for chromosome 7 and 0.86 Mb for chromosome 9. A physical map, showing the position of annotated putative candidate genes that co-localised with SNP markers associated with traits of interest (MTAs), was then constructed using SpiderMap (Rami, 2007 unpublished, Spidermap v1.7.1, free software, CIRAD).

Phenotypic data analysis
The phenotypic data for the fully characterized accessions are presented in S2 . Tests of normality indicated significant deviation from normality for fruit length, width, fruit length to width ratio, total fresh seed mass, seed number, individual seed (cotyledon) mass and width, seed length to width ratio, pod index, ovule number, sepal length and width, and style length. Results of tests of normality performed on the natural log transformed values of fruit and seed quantitative traits indicated that natural log transformation corrected for the deviation from normality of these traits (S3 Correlation analysis revealed the interdependence of traits on one another. Of the 91 Pearson correlation coefficients, r, calculated for the quantitative traits, only 20 were not significant (P � 0.05). It is noteworthy that the yield-related traits, seed number, seed/cotyledon mass and seed dimensions, were highly correlated (P � 0.001). Correlations of the yield-related traits for 346 cacao accessions are presented in the Correlogram (Fig 2) Seed number was negatively correlated (r = -0.205 ��� ) with individual dried cotyledon mass. Seed/cotyledon length and width were positively correlated with individual dried seed/cotyledon mass (r = 0.688 ��� and

PLOS ONE
0.751 ��� , respectively). This suggests that the former traits can be used as indicators of the latter, a significant finding.
There was also a strong correlation (0.568 ��� ) between ovule number and seed number that justifies prediction of the latter using the former when fruits are unavailable. The correlations of fruit width with pod index and seed length, r = -0.621 ��� and 0.680 ��� , respectively, were also of interest.
Correlations for anthocyanin intensity in the various plant organs are presented in Fig 2. All of these correlations, based on the Spearman correlation coefficient (r s ), were significant (P � 0.05) except for the anthocyanin pigment concentration in the pedicel and anthocyanin pigmentation of the cotyledon (r = -0.097). It was notable that the correlations involving mature fruit surface (ridges) anthocyanin intensity and that in the ligule and filament of the flower and in the seed/cotyledon were all significantly (P � 0.05) negative (r = -0.150, -0.257, -0.147, respectively).

Population structure
Ten replicate runs of models using genotypic clusters (K) from 2 to 15 confirmed that K = 7 had the highest log-likelihood probability (log Pr (X | K) versus K) (Fig 3).
The population structure analysis revealed that 74% of the accessions could be stratified into seven sub-populations, while 26% could be regarded as admixtures. The constitution of the seven genetic clusters is presented in S6

Linkage disequilibrium (LD)
D 2 greater than 0.6, where D 2 of 1 represents the highest amount of disequilibrium possible, was indicative of recombination or homoplasy between pairs of alleles (Table 2). With regard to the mean squared correlations of allele frequencies, r 2 values, none were observed to be greater than 0.2 and the mean r 2 across the 10 chromosomes was 0.1 (Table 2). Chromosomes 1 and 3 had mean r 2 > 0.14. An average decay of r 2 to 50% (to r 2 = 0.1) over chromosomes 1, 4, 5, 7 and 9 occurred at a distance of 5.21 Mb (9 cM) (S7

Putatively robust marker-trait associations (MTAs)
The TASSEL MLM GWAS unravelled 17 significant associations. These were significant above the stringent Bonferroni threshold of P � 8.17 × 10 −5 . There was a predominance of positive associations between highly heritable qualitative traits such as fruit shape, anthocyanin intensity on fruit surface, fruit apex form, and flower filament anthocyanin intensity and SNP markers (Table 3, S8 Table http://dx.doi.org/10.13140/RG.2.2.34215.21927). However, there were also some significant associations between SNPs and quantitative traits. Two of the seventeen highly significant (P � 8.17 × 10 −5 ) MTAs, which were identified with TASSEL MLM, were corroborated at this level of significance with GAPIT FarmCPU (Table 3). Interestingly, the most significant (P � 1.15 × 10 −14 ) MTA based on the TASSEL MLM, which was observed for seed number and TcSNP 1335 on chromosome 7, and another involving TcSNP 1335 and log seed length on chromosome 7 were not detected when the FarmCPU model was used. However, FarmCPU revealed a significant MTA involving TcSNP 1126 and seed length on chromosome 7. The highly significant MTA involving log seed number and TcSNP 785 on chromosome 1 (P� 2.38 × 10 −05 ) ( Table 3) was also not verified with FarmCPU, but a MTA for TcSNP 67 and seed length was identified on chromosome 1 (P � 0.005). A comparison of the outputs of GWAS based on TASSEL MLM and GAPIT FarmCPU indicated 14 common MTAs involving yield-related traits at P � 0.003 (Table 3). Manhattan plots and associated Quantile-Quantile plots are presented for yield-related traits in Fig 6, based on TASSEL MLM.
The relatively low number of SNPs significantly associated with phenotypic traits, 2.4% of all SNPs used, was probably partly due to the stringent statistical thresholds applied in this study. Consequently, the results were carefully scrutinized just below the level of significance to discern MTAs, which may also have valid functional (biological) significance and to preclude false negative MTAs associated with TASSEL MLM. The results of the GAPIT FarmCPU analysis were confirmatory for 15 significant MTAs, including 14 yield-related traits, found

PLOS ONE
with TASSEL MLM. These were significant above the Bonferroni threshold in two cases and at P � 0.003 in the others (

Putative annotated candidate genes
Nine putative annotated candidate genes with functional roles related to seed development, lipid biosynthesis and transfer and carbohydrate transport were identified on chromosome 1 based on TASSEL MLM (Table 4). TcSNP 785, at position 31,785,158 bp on chromosome 1, was co-localized with gene Tc01v2_g025880, which is functionally significant since it encodes protein disulfide isomerase that may be required for proper pollen development, ovule fertilization and embryo development (Table 4). This finding was not corroborated with the Farm-CPU analysis, which only detected TcSNP 67 to be significantly associated with seed length on chromosome 1 (at position 35,469,159 bp). The functional protein co-localised with TcSNP 67 was 40S ribosomal protein S13. On chromosome 3, 11 putative candidate genes were detected with TASSEL MLM. They encode traits associated with seed development and lipid accumulation (Table 4).
There were two putative candidate genes involved with seed development co-localised with TcSNP 344 on chromosome 4, six that encode for seed development that were co-localised with TcSNP 667 (TASSEL MLM), eight responsible for seed protein and development and sugar transport that were linked to TcSNP 555, including Sugar transporter ERD6-like 16, and five with seed development functions co-localised with TcSNP 1160 (TASSEL MLM) ( Table 4). In addition, seven genes involved with lipid formation and seed development, such as Acyltransferase-like protein At3g26840%2C chloroplastic, Glycerol-3-phosphate dehydrogenase, and Stearoyl-[acyl-carrier-protein] 9-desaturase, were localised close to TcSNP 953 on chromosome 4. TcSNP 953 (position 2,822,152 on chromosome 4) is located 3.7 Kb upstream of the gene that encodes acyl transferase-like protein At3g26840. Putative E3 ubiquitin protein ligase DRIP2, which acts as a negative regulator of the response to water stress was also colocalised with SNP 555 on chromosome 4 (Table 4).
On chromosome 5, seven putative candidate genes were co-localised with TcSNP 733. These were all involved in seed development. Of particular interest was Glycerol-3-phosphate 2-O-acyltransferase 6, which is rate-limiting enzyme in the de novo pathway of glycerolipid synthesis (Table 4). One candidate gene encodes soluble inorganic pyrophosphatase 4, which is important for development, but also in stress resistance (including to cadmium ion response) in plants (https://www.uniprot.org/uniprot/Q9LFF9). The latter was not of functional significance in this study, but is important for optimised cocoa production systems. In addition, one functional candidate gene involved with seed development was co-localised with TcSNP 823 on chromosome 5, based on TASSEL MLM (Table 4). Three putative candidate genes were found on chromosome 7, one of which was a sugar transporter. Likewise, three functional candidate genes with seed development roles were detected on chromosome 8 although the MTA, involving pod index and SNP 642, was below the stringent Bonferroni threshold (P � 4.39 × 10 −04 ) based on TASSEL MLM. The latter MTA was verified and significant at P � 1.02 × 10 −3 based on the FarmCPU analysis (Table 3).  TASSEL MLM revealed a MTA of functional significance on chromosome 9, albeit below the Bonferroni threshold. TcSNP 184, located at 10,934,938 bp on chromosome 9, was colocalized with four putative candidate genes, one of which encodes Zinc finger protein CON-STANS-LIKE 5. The latter is involved in the regulation of flower development and regulation of transcription. The most significant MTA (P � 9.57 × 10 −05 ) on chromosome 9, based on the FarmCPU analysis, involved TcSNP932, at position 1,479,475 bp, and log cotyledon mass (Table 3). It was 14.3 Kb downstream of 1-acyl-sn-glycerol-3-phosphate acyltransferase 1%2C chloroplastic, which has biological significance in this study.
It must also be noted that TcSNP 401 (20,485,872), located 199 Kb upstream of Tc00_t058610, which encodes a putative MYB-related protein 308, 2 Mb upstream of the gene, Tc04v2_g008890, which encodes a putative MYB family transcription factor, and 1.3 Mb upstream of the gene, Tc04v2_g009300, which encodes a MYB domain protein 20, was significantly associated with fruit surface (ridge) anthocyanin intensity on chromosome 4, based on the TASSEL and FarmCPU analyses. A significant association (P � 10 × 4.57 −05 ) (TASSEL MLM) was also found between fruit surface anthocyanin intensity and SNP 644 on chromosome 9 (S8 With regard to floral traits, there was a highly significant (P � 3.81 × 10 −05 ) MTA, which was detected between sepal length and TcSNP 1334 on chromosome 3 when TASSEL MLM was performed (S8 Of the 17 significant MTAs identified with TASSEL MLM, there were two for fruit shape oblate (an uncommon phenotype in this diverse cacao germplasm sample), and orbicular on chromosome 3. These MTAs involved TcSNPs 1353 and 1477. Interestingly, loci controlling fruit shape were dispersed over several chromosomes, representing independent linkage groups.
In summary, about 40 putative candidate genes of functional importance were identified in this study. These included those that encode protein precursors, carbohydrate transport, lipid synthesis/bioassembly, binding and metabolism, lipid transfer and seed storage, endosperm development and regulation of seed growth, embryo development leading to seed dormancy, seed development/morphogenesis and regulation of flower and pollen development and ovule fertilization as well as responses to water deprivation, cadmium contamination and other abiotic stresses and biotic (disease) stresses (Table 4).
• Chromosome "11" was designated for unmapped SNP markers (some of which have recently been mapped).
• X-and Y-axes represent the SNP markers along each chromosome and the -log10(P-value), respectively.
• The red horizontal line corresponds to the Bonferonni significance threshold of P-values � 8.17 × 10 −5 (-log10 (P) = 4.088) and the blue line corresponds to a significance level of 0.005. Fig 6 Quantile-quantile plots of estimated−log10 (P) from genome-wide association studies using TASSEL MLM. Quantile-quantile plots of estimated−log10 (P) for filament anthocyanin intensity; Quantile-quantile plots of estimated−log10 (P) for fruit surface (ridges) anthocyanin intensity; Quantile-quantile plots of estimated−log10 (P) for log fruit length; Quantile-quantile plots of estimated−log10 (P) for log seed length; Quantile-quantile plots of estimated−log10 (P) for log seed number; Quantile-quantile plots of estimated−log10 (P) for seed length to width ratio; Quantile-quantile plots of estimated−log10 (P) for seed number. The plots provide no evidence of bias in the GWAS, such as due to genotyping artifacts, and display the extent to which the observed distribution of the test statistic followed the expected (null) distribution. The red line represents expected P-values with no associations.

Genomic prediction value of traits
The genomic estimated breeding values (GEBV) of the phenotypic traits studied are presented in S9 Of the qualitative traits studied, fruit basal constriction and filament anthocyanin intensity had predictive values greater than 0.5. The quantitative traits, seed number, seed mass, seed length, seed width, seed length to width ratio, pod index, fruit wall hardness, ovule number and fruit width had GEBV values greater than 0.5. The yield-related traits, seed number and dried seed (cotyledon) mass had GEBV values of 0.611 and 0.6014, respectively. The GEBV value of cotyledon length and width, indicators of seed size, were 0.6199 and 0.5435, respectively. Interestingly, ovule number had the largest GEBV value of 0.6325. It is regarded as a reliable predictor of seed number, which is dependent on successful pollination.

Linkage disequilibrium patterns
Our findings, based on the linkage disequilibrium analyses, suggest that MTAs could be localized in the cacao genome with relatively high precision particularly when wild cacao genotypes are studied. This concurs with the observations of Stack et al. [40], who observed that the wild genotypes (such as those from the Purús, Contamana, Curaray, Iquitos, Nanay, Marañón and Guiana genetic groups) exhibited very low overall LD. In our study, an average decay of r 2 to 50% over chromosomes 1, 4, 5, 7 and 9 occurred at a relatively short distance of 5.21 Mb (9 cM) on average (S7

PLOS ONE
per fruit in diverse cacao germplasm should prove valuable for future genomics-assisted breeding for yield improvement. The relatively small effect size of the markers associated with yield-related traits, in this study, where none of the markers explained more than 20% of the phenotypic variation expressed, is not unusual for quantitative traits. Most of the markers studied explained 5 to 11% of the phenotypic variation expressed.
Genetic variation of quantitative (polygenic, continuous) traits such as yield and disease resistance are controlled by the combined effects of QTL, epistasis (interactions between QTLs) [14], the environment and interaction between environment and QTL [73]. The use of only biallelic subsets of SNPs, in this study, could have excluded multiallelic loci, which may have contributed to additional variance expressed in the study population for polygenic traits, such as those related to yield potential. Mir et al. [74] described yield as a very complex quantitative trait that is controlled by a network of minor genes. For such polygenic traits, with small effect size, increasing the sample size of the study population and densely sampling a population that shows phenotypic diversity should improve the power to detect meaningful associations [75].
A phenomenon whereby significant associations were found in this study between certain traits, such as fruit anthocyanin intensity, fruit shape and seed length to width ratio and seed number, at different loci (Table 3) was of interest. It suggests that a large part of trait variance was explained by several marker-trait associations, as has been observed in other studies [73]. In this study, there were 26 SNPs significantly associated with fruit shape, two on chromosome 1, seven on chromosome 2, nine on chromosome 3, three on chromosome 5, two on chromosome 7 and three on chromosome 8, based on the results of TASSEL MLM. It seems justifiable to hypothesize that minor genes as well as major genes affect fruit shape. The oblate shape is a trait associated with certain wild types, which have evolved over a long period of time. A wellknown cacao accession with this trait is CATONGO.
The presence of markers significantly associated with different traits, in the same genomic region, was also observed in this study, based on the TASSEL MLM. The latter traits were seed number and orbicular shape on chromosome 7 (SNP 390) and seed number and seed length, also on chromosome 7 (SNP 1335) (Tables 3 and 4). These associations may represent the occurrence of linked genes, each one coded separately for a specific trait [11] or may indicate co-localization of the respective markers with a gene or gene block with putative pleiotropic effect [76]. In the case of seed number and seed length, this apparent linkage is striking since pod index (a measure of yield potential) is derived from seed number and dried individual cotyledon mass. The latter was found to be significantly correlated with seed dimensions (length and width) in this study (Fig 2). The relevant associated putative genes may have adaptive influence due to linkage mediated by selective forces [77]. The likelihood of such a phenomenon being observed during this study was feasible due to the inclusion of at least 48 cultivated accessions, including 28 Imperial College Selections (ICS) (S1  2.16179.71202). The latter reportedly evolved over a period of more than two hundred years in Trinidad and Tobago and were selected based on large seed size and seed number and favourable yield, among other selection criteria. It must be underscored that pleiotropic markers may facilitate simultaneous selection of the multiple traits with which they are significantly associated and thus gene pyramiding.

Putative annotated candidate genes for yield-related traits
The search for candidate genes in this study was guided by the fact that the storage compounds of cacao seeds are starch, lipids (fats) and storage proteins [78]. Bucheli et al. [79] investigated the variation of sugars, carboxylic acids, purine alkaloids, fatty acids, and endoproteinase activity during maturation of cacao seeds. Aspartic endoproteinase activity was observed to increase rapidly during seed expansion and a major change in the fatty acid composition occurred in the young embryo. Mustiga et al. [80] detected a major QTL explaining 24% of the relative level of palmitic acid on the distal end of chromosome 4, located close to the Thec-c1EG017405 gene. The latter is an orthologue and isoform of the stearoyl-acyl carrier protein (ACP) desaturase (SAD) gene that is involved in fatty acid biosynthesis. Cacao seeds also contain a vicilin-like globulin, a seed storage protein [81].
Triacylglycerols (TAGs) are the major storage lipids in several plants and serve as energy reserves in seeds that are later used for germination and seedling development [84,85]. The terminal step in TAG formation in plants involves the catalytic action of diacylglycerol acyltransferase (DGAT) in the presence of acyl-CoA [84]. Developing seeds in Brassica napus have been reported to produce Diaylglycerol (DAG) during the active phase of oil accumulation [86].
The proteins encoded by candidate genes, which were co-localized with SNPs found to be significantly associated with yield-related traits, during this study (Table 3), are presented in Table 4. An association (just below the stringent significance level with Bonferroni correction) observed between seed length (TASSEL MLM and FarmCPU) and seed cotyledon mass (Farm-CPU) and TcSNP 953 on chromosome 4, at a position of 2,822,152 bp, 3.7 Kb upstream of a gene that encodes diacylglycerol acyltransferase (Acyltransferase-like protein At3g26840%2C chloroplastic) (Table 4), is among the most noteworthy of this study and warrants further investigation with expression studies.
At3g26840 is involved in seed storage (globulins) and seed size in Assembly cotton 46 (Gossypium spp.) [82]. Gossypium spp. are related to cacao, both being members of the Malvaceae family. The discovery of genes that encode Glycerol-3-phosphate dehydrogenase, and Stearoyl-[acyl-carrier-protein] 9-desaturase on chromosome 4 was also of relevance. The association revealed by FarmCPU between TcSNP 932, on chromosome 9, with log cotyledon mass, must be considered due to its functional importance. This SNP was co-localised with 1-acylsn-glycerol-3-phosphate acyltransferase 1%2C chloroplastic, which is involved in lipid metabolism.
Another putative candidate gene, unravelled during this study, was Tc01v2_g022850, which encodes Bifunctional inhibitor/lipid-transfer protein/seed storage 2S albumin superfamily protein involved in Glycerolipid metabolism (chromosome 1, 2.5 Mb downstream of SNP 785 (31,785,158) on chromosome 1) ( Table 4). It is must also be recorded that TcSNP 555, on chromosome 4, was co-localised with vicilin in this study.

Consensus MTAs involving yield-related traits in cacao
Since several hundred MTAs or QTLs have previously been identified in cacao, an attempt was made to check for congruence between the findings of this study and previous ones with respect to common traits. Highly significant (stable) associations between yield-related traits, seed length and seed length to width ratio, seed number and seed (cotyledon) mass and pod index, and SNPs were found on chromosomes 1, 3, 4, 5, 7 and 9 in this study ( Table 3). The findings from previous studies that coincide with those of this study are highlighted below. Eight significant associations between SSR markers and yield-related traits were reported by Marcano et al. [87]. These included associations with fresh seed mass on chromosomes 1, 2, 5, 6, 9 and 10, MTAs involving dried seed mass (100 seeds) on chromosomes 2, 4, 9 and 10 as well as one marker associated with seed number per fruit on chromosome 5. In their mapping study, dos Santos Fernandes et al. [88] identified several QTLs flanked by the markers on chromosome 4, which were associated with pod index, dried individual seed mass, number of fruits harvested and number of healthy fruits harvested. They also identified a significant association between the marker, Tcm002s23708704, and pod index on chromosome 2. dos Santos Fernandes et al. [88] unravelled 13 candidate genes linked to yield (dried seed mass, pod number), on chromosomes 4 and 2. Nine of these genes are annotated as transmembrane transporters, specializing in sugar transport, two genes are involved in carbohydrate metabolism, one gene codes for lipid metabolism, and one is involved in glucose metabolism [88]. Motilal et al. [64] identified three SNPs (TcSNP 368, 697, 1370) on chromosomes 1 and 9 that were significantly associated with seed number. Clément et al. [20] found two QTLs for yield in the clone, POUND 12, located close to a QTL for yield, identified in IMC 78, on chromosome 4.
Previous studies thus commonly observed loci on chromosomes 1, 2, 4, and 9 to be associated with seed mass and dimensions in cacao [87]. The findings of this study suggest that yield-related traits are associated with loci on chromosomes 1, 3, 4, 5, 7, 8 and 9, putatively linked to functional genes (Tables 3 and 4). Two SNPs, TcSNP 344 and 953, were associated with seed length on chromosome 4 (Tables 3 and 4). TcSNP 953 is located at the top of chromosome 4 (Fig 7) and thus the MTA, though below the stringent Bonferroni level of significance, may be considered validated since it was observed in a common region where yieldrelated MTAs were located in the studies by dos Santos Fernandes et al. [88] and Clément et al. [20].
A highly significant MTA, observed in this study, involved seed number and TcSNP 785 on chromosome 1 (TASSEL MLM). Motilal et al. [64] also reported such an MTA on chromosome 1.
Criollo, Forastero and Trinitario are widely recognised classes of cocoa in the trade. The cacao accessions in this study that displayed favourable yield potential were mainly Trinitario (cultivated germplasm) [44] and those with Criollo ancestry. This was due mainly to their large seed sizes (Table 5). This supports the deduction of Doebley et al. [89] that "cultivated species generally have larger fruits or seeds compared to their wild ancestors, indicating that fruit and seed size are major agronomic traits that have been selected in crops during their domestication." However, several Upper Amazon Forastero types, in this study, also had favourable (low) pod index due to their large seed numbers (

Improving yield potential with MTAs unravelled
Based on the findings of this study, elucidation and selection of genotypes associated with large seed size in T. cacao may thus be facilitated by using TcSNP 953, TcSNP 555 and TcSNP 344 on chromosome 4 and TcSNP 733 on chromosome 5 (Tables 3 and 4). The results of a preliminary evaluation to detect genotypes associated with favourable yield potential (pod index) are presented in Table 5 and involve TcSNP 555, TcSNP 642 (on chromosome 8) and TcSNP 953. Further investigation with haplotype inferencing for genomic prediction and involving training and test populations of T. cacao L. in genomic selection studies, as described by Bhat et al. [90], is recommended.
Additional studies to investigate functional genomics associated with yield-related traits in cacao are also recommended. Such studies, in wheat, have revealed transcription factors, which can affect seed number, genes involved in metabolism or signalling of growth regulators, genes determining cell division and proliferation related to seed size, and floral regulators that regulate inflorescence architecture and seed number. Genes involved in carbohydrate metabolism, affecting plant architecture and grain yield such as trehalose phosphate synthase (TPS) and trehalose phosphate phosphatase (TPP) genes have also been identified [91].
These recommended follow-up studies would entail gene expression analysis, involving transcriptomics as described by Jako et al. [82]. The DGAT gene, Tag1, from Arabidopsis was shown to encode an acyl-CoA-dependent DGAT. Jako et al. [82] also demonstrated that seedspecific over-expression of the DGAT cDNA in wild-type Arabidopsis "enhances oil deposition and average seed mass, which are correlated with DGAT transcript levels", and that "DGAT has an important role in regulating the quantity of seed TAGs, the sink size in developing seeds and thus seed size." They also demonstrated that "over-expression" of the acyl-CoAdependent DGAT in a "seed-specific manner in wild-type Arabidopsis plants results in increased oil deposition and average seed mass." Seed size has also been shown to be directly determined by carbohydrate import into seeds, in maize and rice, and involves SWEET-mediated hexose transport [92]. SWEET genes regulate the transport, distribution and storage of carbohydrates in plants, and are involved in many important physiological processes, including phloem loading, reproductive development, disease-resistance, stress response, and host-pathogen interaction. In this study, SWEET17 was localized on chromosome 4 upstream of TcSNP555 (Table 4), which was associated with pod index (P � 4.29 × 10 −4 ). SWEET 2 was also localized downstream of TcSNP1335, which was significantly associated (P� 1.15 ×10 −14 ) with seed number as well as with log seed length (P� 6.75 × 10 −05 ) on chromosome 7 (TASSEL MLM) ( Table 4).
Another significant association (P � 6.79 × 10 −05 ) for fruit surface anthocyanin intensity was found in this study with TASSEL MLM analysis. It involved TcSNP 1203, on chromosome 3 (located at 563,101 bp), which accounted for 5.6% of the phenotypic variation in fruit surface anthocyanin concentration in this cohort of germplasm.
Three regions associated with pigmentation on different organs in cacao have been identified by Marcano et al. [87]. This sector includes the major locus identified by Crouzillat et al. [12] as responsible for controlling 'seed colour' in Catongo x POUND 12 backcross progeny.
MYB proteins are involved in regulatory networks controlling metabolism, including the synthesis of anthocyanins, responsible for the red pigmentation in cacao [93,94]. Liu et al. [95] observed that overexpression of the Tc-MYBPA gene elicited increased expression of several genes encoding the major structural enzymes of the proanthocyanidin and anthocyanidin pathway in cacao.
There were also significant associations between filament anthocyanin concentration and SNPs on chromosomes 3 and 8 (TcSNPs 1183 and 1441, respectively), based on the TASSEL MLM results of this study ( All of the MTAs involving anthocyanin concentration in this study suggest multi-gene control of anthocyanin intensity in the mature fruit epidermis and flower filaments of cacao. Furthermore, the differential expression of pigmentation in the seeds and fruits of cacao, observed at the ICGT, as evidenced in the negative correlations in Fig 2, and as stated by Bartley [96], may be explained by the association of pigmentation of seeds and fruit pigmentation with several genomic regions.
MTAs involving anthocyanin intensity warrant further investigation to determine their value for genomic selection due to the significance of this trait in differentiating among certain genotypes of interest, such as CCN-51 and ancient Nacional cacao [70,94], the nutraceutical value of anthocyanin and its putative role in cacao disease resistance [34,95].

Highlights of this study
The results of this study support the observation of Rockman [97] that most complex traits (such as those related to yield in cacao) are controlled by several (putatively interacting) loci with small effects. While some phenotypic traits are controlled by a small number of loci with large effects (as is often the case for traits under biotic selection) [74], others, such as yield, may have more complex genetic architectures. The latter may be controlled by many rare variants, each having a large effect on the phenotype or conversely, many common variants with only small effects on the phenotypes [98]. The causative variants may be clustered in one or a small number of genes or across many genes.
The data on gene ontology presented in Table 4 provide evidence of polygenic control of yield-related traits in cacao. For such traits, it may be more effective to predict the performance of genotypes by using multiple molecular markers [98]. Multilocus mixed linear models (MMLMs) may thus be considered for future studies in cacao, when complex traits are investigated, because these incorporate multiple markers simultaneously as covariates in a stepwise MLM [99].
The observations regarding drought tolerance in this study are of considerable significance since the conditions at the ICGT, where these accessions were conserved, are considered suboptimal in terms of soil moisture content [44]. Therefore, the latter results, which indicate

PLOS ONE
drought adaptation among genotypes with inherent high yield potential, merit follow-up functional genomic studies.
Once the functional roles of putative genes co-localized with markers with significant associations to traits of interest are elucidated, the effects of relationships of these putative genes with geography and local adaptation must be established, as recommended by McKown et al. [100]. Micheli et al. [101] have reported on functional genomics in cacao focusing on genes expressed under specific physiological conditions. Consistency in QTL effects over different genetic backgrounds must also be established. Individuals identified with favourable marker genotypes and haplotypes may then be used as parental types for enhancement or breeding programmes, such as those involving Multiparent Advanced Generation Intercross Populations (MAGIC), in targeted environments.
Despite the fact that yield is a complex trait, our results on potential genomic selection (GS) for yield traits are very promising, given the high predictive values obtained for these traits, generally superior to 0.5. The detection of several markers associated with yield-related traits with good predictive value, in this study, could facilitate GS and marker-assisted selection (MAS) for yield in cacao.

Conclusion
In this study, carefully collated phenotyping data on traits of economic interest in cacao, such as yield potential (S2 , generated via transcriptome sequencing, were subjected to GWAS. Twenty-nine cacao accessions represent promising material for breeding in terms of yield potential. The results presented herein indicated oligogenic and polygenic control of yield-related traits in cacao. The stringently significant marker-trait associations related to yield, found in this study, were indications of the presence of QTL on chromosomes such as 3, 4, 5, 7, 8 and 9. Chromosome 4 putatively contains a QTL cluster associated with yield-related traits. Some annotated candidate genes associated with seed development, seed lipid accumulation, metabolism and development and plant stress responses (including to drought and to cadmium) have been identified during this research. The identification of yield-related traits with good predictive value could further facilitate GS for yield potential in cacao. The performance of the non-phenotyped individuals at the Trinidad genebank (ICGT) could also be predicted once they are genotyped. This would be particularly useful for the enhanced genotypes (GEBP progeny) [45]. Identification of parents possessing high predictive values and favourable alleles and haplotypes, prior to crossing, should prove beneficial for more rapid development of enhanced cacao progenies.
Supporting information S1